AITopics | imputation performance

Country:

Europe > Spain > Galicia > Madrid (0.04)
Asia > China (0.04)
Pacific Ocean > North Pacific Ocean > San Francisco Bay (0.04)
North America > United States > California > San Francisco County > San Francisco (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry: Energy (0.46)

Technology:

Information Technology > Data Science > Data Mining (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.67)

Neural Information Processing SystemsFeb-17-2026, 01:56:11 GMT

Double and Single Descent in Causal Inference with an Application to High-Dimensional Synthetic Control

Motivated by a recent literature on the double-descent phenomenon in machine learning, we consider highly over-parameterized models in causal inference, including synthetic control with many control units.

artificial intelligence, machine learning, synthetic control, (17 more...)

Country:

North America > United States > District of Columbia > Washington (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
North America > United States > California > Santa Clara County > Stanford (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)

Industry:

Health & Medicine (0.67)
Government (0.46)
Education (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.51)

arXiv.org Machine LearningJan-21-2026

Statistical-Neural Interaction Networks for Interpretable Mixed-Type Data Imputation

Deng, Ou, Nishimura, Shoji, Ogihara, Atsushi, Jin, Qun

Real-world tabular databases routinely combine continuous measurements and categorical records, yet missing entries are pervasive and can distort downstream analysis. We propose Statistical-Neural Interaction (SNI), an interpretable mixed-type imputation framework that couples correlation-derived statistical priors with neural feature attention through a Controllable-Prior Feature Attention (CPFA) module. CPFA learns head-wise prior-strength coefficients $\{λ_h\}$ that softly regularize attention toward the prior while allowing data-driven deviations when nonlinear patterns appear to be present in the data. Beyond imputation, SNI aggregates attention maps into a directed feature-dependency matrix that summarizes which variables the imputer relied on, without requiring post-hoc explainers. We evaluate SNI against six baselines (Mean/Mode, MICE, KNN, MissForest, GAIN, MIWAE) on six datasets spanning ICU monitoring, population surveys, socio-economic statistics, and engineering applications. Under MCAR/strict-MAR at 30\% missingness, SNI is generally competitive on continuous metrics but is often outperformed by accuracy-first baselines (MissForest, MIWAE) on categorical variables; in return, it provides intrinsic dependency diagnostics and explicit statistical-neural trade-off parameters. We additionally report MNAR stress tests (with a mask-aware variant) and discuss computational cost, limitations -- particularly for severely imbalanced categorical targets -- and deployment scenarios where interpretability may justify the trade-off.

data mining, machine learning, missingness, (21 more...)

arXiv.org Machine Learning

2601.1238

Country: Asia > Japan > Honshū > Kantō (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Data Science > Data Quality (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(3 more...)

Ahamed, Md Atik, Ye, Qiang, Cheng, Qiang

RefiDiff: Progressive Refinement Diffusion for Efficient Missing Data Imputation

arXiv.org Artificial IntelligenceNov-13-2025

Missing values in high-dimensional, mixed-type datasets pose significant challenges for data imputation, particularly under Missing Not At Random (MNAR) mechanisms. Existing methods struggle to integrate local and global data characteristics, limiting performance in MNAR and high-dimensional settings. We propose an innovative framework, RefiDiff, combining local machine learning predictions with a novel Mamba-based denoising network efficiently capturing long-range dependencies among features and samples with low computational complexity. RefiDiff bridges the predictive and generative paradigms of imputation, leveraging pre-refinement for initial warm-up imputations and post-refinement to polish results, enhancing stability and accuracy. By encoding mixed-type data into unified tokens, RefiDiff enables robust imputation without architectural or hyperparameter tuning. RefiDiff outperforms state-of-the-art (SOT A) methods across missing-value settings, demonstrating strong performance in MNAR settings and superior out-of-sample generalization. Extensive evaluations on nine real-world datasets demonstrate its robustness, scalability, and effectiveness in handling complex missingness patterns.

data mining, imputation, machine learning, (18 more...)

2505.14451

Country:

North America > United States > Kentucky (0.40)
North America > United States > California (0.05)
Asia > Taiwan (0.04)

Genre: Research Report > New Finding (0.92)

Industry:

Information Technology (0.67)
Health & Medicine (0.46)

Technology:

Information Technology > Data Science > Data Quality (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(3 more...)

Neural Information Processing SystemsOct-10-2025, 12:47:42 GMT

Learning from Highly Sparse Spatio-temporal Data

Incomplete spatio-temporal data in the real world has spawned much research.

dataset, information, st point, (17 more...)

Country:

Europe > Spain > Galicia > Madrid (0.04)
Asia > China (0.04)
Pacific Ocean > North Pacific Ocean > San Francisco Bay (0.04)
North America > United States > California > San Francisco County > San Francisco (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry: Energy (0.46)

Technology:

Information Technology > Data Science > Data Mining (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.67)

Neural Information Processing SystemsOct-9-2025, 07:17:20 GMT

c904c5d43d8a01177063977bd67bf6fc-Paper-Conference.pdf

artificial intelligence, machine learning, synthetic control, (17 more...)

Country:

North America > United States > District of Columbia > Washington (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
North America > United States > California > Santa Clara County > Stanford (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)

Industry:

Health & Medicine (0.67)
Government (0.46)
Education (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.51)

arXiv.org Artificial IntelligenceSep-30-2025

Impute-MACFM: Imputation based on Mask-Aware Flow Matching

Liu, Dengyi, Wang, Honggang, Fang, Hua

Tabular data are central to many applications, especially longitudinal data in healthcare, where missing values are common, undermining model fidelity and reliability. Prior imputation methods either impose restrictive assumptions or struggle with complex cross-feature structure, while recent generative approaches suffer from instability and costly inference. We propose Impute-MACFM, a mask-aware conditional flow matching framework for tabular imputation that addresses missingness mechanisms, missing completely at random, missing at random, and missing not at random. Its mask-aware objective builds trajectories only on missing entries while constraining predicted velocity to remain near zero on observed entries, using flexible nonlinear schedules. Impute-MACFM combines: (i) stability penalties on observed positions, (ii) consistency regularization enforcing local invariance, and (iii) time-decayed noise injection for numeric features. Inference uses constraint-preserving ordinary differential equation integration with per-step projection to fix observed values, optionally aggregating multiple trajectories for robustness. Across diverse benchmarks, Impute-MACFM achieves state-of-the-art results while delivering more robust, efficient, and higher-quality imputation than competing approaches, establishing flow matching as a promising direction for tabular missing-data problems, including longitudinal data.

imputation performance, machine learning, natural language, (16 more...)

2509.23126

Country: North America > United States > Massachusetts (0.46)

Genre: Research Report (0.50)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
(2 more...)

Gupta, Vaibhav, Maleshkova, Maria

Evaluating Imputation Techniques for Short-Term Gaps in Heart Rate Data

arXiv.org Artificial IntelligenceAug-13-2025

Recent advances in wearable technology have enabled the continuous monitoring of vital physiological signals, essential for predictive modeling and early detection of extreme physiological events. Among these physiological signals, heart rate (HR) plays a central role, as it is widely used in monitoring and managing cardiovascular conditions and detecting extreme physiological events such as hypoglycemia. However, data from wearable devices often suffer from missing values. To address this issue, recent studies have employed various imputation techniques. Traditionally, the effectiveness of these methods has been evaluated using predictive accuracy metrics such as RMSE, MAPE, and MAE, which assess numerical proximity to the original data. While informative, these metrics fail to capture the complex statistical structure inherent in physiological signals. This study bridges this gap by presenting a comprehensive evaluation of four statistical imputation methods, linear interpolation, K Nearest Neighbors (KNN), Piecewise Cubic Hermite Interpolating Polynomial (PCHIP), and B splines, for short term HR data gaps. We assess their performance using both predictive accuracy metrics and statistical distance measures, including the Cohen Distance Test (CDT) and Jensen Shannon Distance (JS Distance), applied to HR data from the D1NAMO dataset and the BIG IDEAs Lab Glycemic Variability and Wearable Device dataset. The analysis reveals limitations in existing imputation approaches and the absence of a robust framework for evaluating imputation quality in physiological signals. Finally, this study proposes a foundational framework to develop a composite evaluation metric to assess imputation performance.

artificial intelligence, imputation technique, machine learning, (14 more...)

2508.08268

Genre: Research Report > New Finding (0.47)

Industry:

Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (1.00)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.55)

arXiv.org Artificial IntelligenceMar-28-2025

MNT-TNN: Spatiotemporal Traffic Data Imputation via Compact Multimode Nonlinear Transform-based Tensor Nuclear Norm

Lu, Yihang, Yousaf, Mahwish, Meng, Xianwei, Chen, Enhong

Imputation of random or non-random missing data is a long-standing research topic and a crucial application for Intelligent Transportation Systems (ITS). However, with the advent of modern communication technologies such as Global Satellite Navigation Systems (GNSS), traffic data collection has outpaced traditional methods, introducing new challenges in random missing value imputation and increasing demands for spatiotemporal dependency modelings. To address these issues, we propose a novel spatiotemporal traffic imputation method, Multimode Nonlinear Transformed Tensor Nuclear Norm (MNT-TNN), grounded in the Transform-based Tensor Nuclear Norm (TTNN) optimization framework which exhibits efficient mathematical representations and theoretical guarantees for the recovery of random missing values. Specifically, we strictly extend the single-mode transform in TTNN to a multimode transform with nonlinear activation, effectively capturing the intrinsic multimode spatiotemporal correlations and low-rankness of the traffic tensor, represented as location $\times$ location $\times$ time. To solve the nonconvex optimization problem, we design a proximal alternating minimization (PAM) algorithm with theoretical convergence guarantees. We suggest an Augmented Transform-based Tensor Nuclear Norm Families (ATTNNs) framework to enhance the imputation results of TTNN techniques, especially at very high miss rates. Extensive experiments on real datasets demonstrate that our proposed MNT-TNN and ATTNNs can outperform the compared state-of-the-art imputation methods, completing the benchmark of random missing traffic value imputation.

data mining, machine learning, tensor, (19 more...)

2503.22955

Country:

Asia > China > Anhui Province > Hefei (0.04)
North America > United States > California (0.04)
North America > United States > Wisconsin > Dane County > Madison (0.04)
North America > United States > Ohio > Franklin County > Columbus (0.04)

Genre: Research Report (1.00)

Industry:

Transportation (1.00)
Media > Television (0.86)
Leisure & Entertainment (0.86)

Technology:

Information Technology > Data Science > Data Quality (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Sundsgaard, Konrad, Bölat, Kutay, Yang, Guangya

Data Enrichment Opportunities for Distribution Grid Cable Networks using Variational Autoencoders

arXiv.org Artificial IntelligenceJan-18-2025

Electricity distribution cable networks suffer from incomplete and unbalanced data, hindering the effectiveness of machine learning models for predictive maintenance and reliability evaluation. Features such as the installation date of the cables are frequently missing. To address data scarcity, this study investigates the application of Variational Autoencoders (VAEs) for data enrichment, synthetic data generation, imbalanced data handling, and outlier detection. Based on a proof-of-concept case study for Denmark, targeting the imputation of missing age information in cable network asset registers, the analysis underlines the potential of generative models to support data-driven maintenance. However, the study also highlights several areas for improvement, including enhanced feature importance analysis, incorporating network characteristics and external features, and handling biases in missing data. Future initiatives should expand the application of VAEs by incorporating semi-supervised learning, advanced sampling techniques, and additional distribution grid elements, including low-voltage networks, into the analysis.

artificial intelligence, machine learning, vae, (18 more...)

2501.1092

Country:

Europe > Netherlands > South Holland > Delft (0.04)
Europe > Denmark > Capital Region > Copenhagen (0.04)
Europe > Spain > Galicia > Madrid (0.04)
Europe > Italy > Lazio > Rome (0.04)

Genre: Research Report (0.50)

Industry: Energy > Power Industry (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)